-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: FAISS vectorstore - consistent Document id field #27101
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
dosubot
bot
added
size:M
This PR changes 30-99 lines, ignoring generated files.
community
Related to langchain-community
Ɑ: vector store
Related to vector store module
labels
Oct 4, 2024
…8274) Thank you for contributing to LangChain! Ctrl+F to find instances of `langchain-databricks` and replace with `databricks-langchain`. Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. Signed-off-by: Prithvi Kannan <[email protected]>
- **docs: poetry publish** - **x** - **x** - **x** - **x** - **x** - **x** - **x** - **x** - **x**
…ain-ai#28269) - **Description:** Corrected the parameter name in the HuggingFaceEmbeddings documentation under integrations/text_embedding/ from model to model_name to align with the actual code usage in the langchain_huggingface package. - **Issue:** Fixes langchain-ai#28231 - **Dependencies:** None
…ssages` (langchain-ai#28267) We have a test [test_structured_few_shot_examples](https://github.com/langchain-ai/langchain/blob/ad4333ca032033097c663dfe818c5c892c368bd6/libs/standard-tests/langchain_tests/integration_tests/chat_models.py#L546) in standard integration tests that implements a version of tool-calling few shot examples that works with ~all tested providers. The formulation supported by ~all providers is: `human message, tool call, tool message, AI reponse`. Here we update `langchain_core.utils.function_calling.tool_example_to_messages` to support this formulation. The `tool_example_to_messages` util is undocumented outside of our API reference. IMO, if we are testing that this function works across all providers, it can be helpful to feature it in our guides. The structured few-shot examples we document at the moment require users to implement this function and can be simplified.
langchain-ai#28296) **Description:** Currently, the docstring for `LanceDB.__init__()` provides the default value for `mode`, but not the list of valid values. This PR adds that list to the docstring. **Issue:** N/A **Dependencies:** N/A **Twitter handle:** `@metadaddy` [Leaving as a reminder: If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.]
…chain-ai#28304) Link migration guide first.
- fix import statement for qdrant - issue: langchain-ai#28012 langchain-ai#28012
pydantic 2.10 compat for langchain-core
pydantic compat 2.10 for langchain
fix small GOOGLE_API_KEY markdown formatting typo
Adds deprecation notices for Neo4j components moving to the `langchain_neo4j` partner package. - Adds deprecation warnings to all Neo4j-related classes and functions that have been migrated to the new `langchain_neo4j` partner package - Updates documentation to reference the new `langchain_neo4j` package instead of `langchain_community`
) JSONparse, in _validate_metadata_func(), checks the consistency of the _metadata_func() function. To do this, it invokes it and makes sure it receives a dictionary in response. However, during the call, it does not respect future calls, as shown on line 100. This generates errors if, for example, the function is like this: ```python def generate_metadata(json_node:Dict[str,Any],kwargs:Dict[str,Any]) -> Dict[str,Any]: return { "source": url, "row": kwargs['seq_num'], "question":json_node.get("question"), } loader = JSONLoader( file_path=file_path, content_key="answer", jq_schema='.[]', metadata_func=generate_metadata, text_content=False) ``` To avoid this, the verification must comply with the specifications. This patch does just that. --------- Co-authored-by: Eugene Yurtsev <[email protected]>
…i#25375) community: add hybrid search in opensearch # Langchain OpenSearch Hybrid Search Implementation ## Implementation of Hybrid Search: I have taken LangChain's OpenSearch integration to the next level by adding hybrid search capabilities. Building on the existing OpenSearchVectorSearch class, I have implemented Hybrid Search functionality (which combines the best of both keyword and semantic search). This new functionality allows users to harness the power of OpenSearch's advanced hybrid search features without leaving the familiar LangChain ecosystem. By blending traditional text matching with vector-based similarity, the enhanced class delivers more accurate and contextually relevant results. It's designed to seamlessly fit into existing LangChain workflows, making it easy for developers to upgrade their search capabilities. In implementing the hybrid search for OpenSearch within the LangChain framework, I also incorporated filtering capabilities. It's important to note that according to the OpenSearch hybrid search documentation, only post-filtering is supported for hybrid queries. This means that the filtering is applied after the hybrid search results are obtained, rather than during the initial search process. **Note:** For the implementation of hybrid search, I strictly followed the official OpenSearch Hybrid search documentation and I took inspiration from https://github.com/AndreasThinks/langchain/tree/feature/opensearch_hybrid_search Thanks Mate! ### Experiments I conducted few experiments to verify that the hybrid search implementation is accurate and capable of reproducing the results of both plain keyword search and vector search. Experiment - 1 Hybrid Search Keyword_weight: 1, vector_weight: 0 I conducted an experiment to verify the accuracy of my hybrid search implementation by comparing it to a plain keyword search. For this test, I set the keyword_weight to 1 and the vector_weight to 0 in the hybrid search, effectively giving full weightage to the keyword component. The results from this hybrid search configuration matched those of a plain keyword search, confirming that my implementation can accurately reproduce keyword-only search results when needed. It's important to note that while the results were the same, the scores differed between the two methods. This difference is expected because the plain keyword search in OpenSearch uses the BM25 algorithm for scoring, whereas the hybrid search still performs both keyword and vector searches before normalizing the scores, even when the vector component is given zero weight. This experiment validates that my hybrid search solution correctly handles the keyword search component and properly applies the weighting system, demonstrating its accuracy and flexibility in emulating different search scenarios. Experiment - 2 Hybrid Search keyword_weight = 0.0, vector_weight = 1.0 For experiment-2, I took the inverse approach to further validate my hybrid search implementation. I set the keyword_weight to 0 and the vector_weight to 1, effectively giving full weightage to the vector search component (KNN search). I then compared these results with a pure vector search. The outcome was consistent with my expectations: the results from the hybrid search with these settings exactly matched those from a standalone vector search. This confirms that my implementation accurately reproduces vector search results when configured to do so. As with the first experiment, I observed that while the results were identical, the scores differed between the two methods. This difference in scoring is expected and can be attributed to the normalization process in hybrid search, which still considers both components even when one is given zero weight. This experiment further validates the accuracy and flexibility of my hybrid search solution, demonstrating its ability to effectively emulate pure vector search when needed while maintaining the underlying hybrid search structure. Experiment - 3 Hybrid Search - balanced keyword_weight = 0.5, vector_weight = 0.5 For experiment-3, I adopted a balanced approach to further evaluate the effectiveness of my hybrid search implementation. In this test, I set both the keyword_weight and vector_weight to 0.5, giving equal importance to keyword-based and vector-based search components. This configuration aims to leverage the strengths of both search methods simultaneously. By setting both weights to 0.5, I intended to create a scenario where the hybrid search would consider lexical matches and semantic similarity equally. This balanced approach is often ideal for many real-world applications, as it can capture both exact keyword matches and contextually relevant results that might not contain the exact search terms. Kindly verify the notebook for the experiments conducted! **Notebook:** https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Opensearch_Hybridsearch.ipynb ### Instructions to follow for Performing Hybrid Search: **Step-1: Instantiating OpenSearchVectorSearch Class:** ```python opensearch_vectorstore = OpenSearchVectorSearch( index_name=os.getenv("INDEX_NAME"), embedding_function=embedding_model, opensearch_url=os.getenv("OPENSEARCH_URL"), http_auth=(os.getenv("OPENSEARCH_USERNAME"),os.getenv("OPENSEARCH_PASSWORD")), use_ssl=False, verify_certs=False, ssl_assert_hostname=False, ssl_show_warn=False ) ``` **Parameters:** 1. **index_name:** The name of the OpenSearch index to use. 2. **embedding_function:** The function or model used to generate embeddings for the documents. It's assumed that embedding_model is defined elsewhere in the code. 3. **opensearch_url:** The URL of the OpenSearch instance. 4. **http_auth:** A tuple containing the username and password for authentication. 5. **use_ssl:** Set to False, indicating that the connection to OpenSearch is not using SSL/TLS encryption. 6. **verify_certs:** Set to False, which means the SSL certificates are not being verified. This is often used in development environments but is not recommended for production. 7. **ssl_assert_hostname:** Set to False, disabling hostname verification in SSL certificates. 8. **ssl_show_warn:** Set to False, suppressing SSL-related warnings. **Step-2: Configure Search Pipeline:** To initiate hybrid search functionality, you need to configures a search pipeline first. **Implementation Details:** This method configures a search pipeline in OpenSearch that: 1. Normalizes the scores from both keyword and vector searches using the min-max technique. 2. Applies the specified weights to the normalized scores. 3. Calculates the final score using an arithmetic mean of the weighted, normalized scores. **Parameters:** * **pipeline_name (str):** A unique identifier for the search pipeline. It's recommended to use a descriptive name that indicates the weights used for keyword and vector searches. * **keyword_weight (float):** The weight assigned to the keyword search component. This should be a float value between 0 and 1. In this example, 0.3 gives 30% importance to traditional text matching. * **vector_weight (float):** The weight assigned to the vector search component. This should be a float value between 0 and 1. In this example, 0.7 gives 70% importance to semantic similarity. ```python opensearch_vectorstore.configure_search_pipelines( pipeline_name="search_pipeline_keyword_0.3_vector_0.7", keyword_weight=0.3, vector_weight=0.7, ) ``` **Step-3: Performing Hybrid Search:** After creating the search pipeline, you can perform a hybrid search using the `similarity_search()` method (or) any methods that are supported by `langchain`. This method combines both `keyword-based and semantic similarity` searches on your OpenSearch index, leveraging the strengths of both traditional information retrieval and vector embedding techniques. **parameters:** * **query:** The search query string. * **k:** The number of top results to return (in this case, 3). * **search_type:** Set to `hybrid_search` to use both keyword and vector search capabilities. * **search_pipeline:** The name of the previously created search pipeline. ```python query = "what are the country named in our database?" top_k = 3 pipeline_name = "search_pipeline_keyword_0.3_vector_0.7" matched_docs = opensearch_vectorstore.similarity_search_with_score( query=query, k=top_k, search_type="hybrid_search", search_pipeline = pipeline_name ) matched_docs ``` twitter handle: @iamkarthik98 --------- Co-authored-by: Karthik Kolluri <[email protected]> Co-authored-by: Eugene Yurtsev <[email protected]>
…gchain-ai#28374) **Description:** This PR introduces a `model` alias for the embedding classes that contain the attribute `model_name`, to ensure consistency across the codebase, as suggested by a moderator in a previous PR. The change aligns the usage of attribute names across the project (see for example [here](https://github.com/langchain-ai/langchain/blob/65deeddd5dfec5d51f33ebc961f09c2e47a8f064/libs/partners/groq/langchain_groq/chat_models.py#L304)). **Issue:** This PR addresses the suggestion from the review of issue langchain-ai#28269. **Dependencies:** None --------- Co-authored-by: Eugene Yurtsev <[email protected]> Co-authored-by: Erick Friis <[email protected]>
…rkdownifyTransformer` (langchain-ai#27866) # Description Implements the `atransform_documents` method for `MarkdownifyTransformer` using the `asyncio` built-in library for concurrency. Note that this is mainly for API completeness when working with async frameworks rather than for performance, since the `markdownify` function is not I/O bound because it works with `Document` objects already in memory. # Issue Fixes langchain-ai#27865 # Dependencies No new dependencies added, but [`markdownify`](https://github.com/matthewwithanm/python-markdownify) is required since this PR updates the `markdownify` integration. # Tests and docs - Tests added - I did not modify the docstrings since they already described the basic functionality, and [the API docs also already included a description](https://python.langchain.com/api_reference/community/document_transformers/langchain_community.document_transformers.markdownify.MarkdownifyTransformer.html#langchain_community.document_transformers.markdownify.MarkdownifyTransformer.atransform_documents). If it would be helpful, I would be happy to update the docstrings and/or the API docs. # Lint and test - [x] format - [x] lint - [x] test I ran formatting with `make format`, linting with `make lint`, and confirmed that tests pass using `make test`. Note that some unit tests pass in CI but may fail when running `make_test`. Those unit tests are: - `test_extract_html` (and `test_extract_html_async`) - `test_strip_tags` (and `test_strip_tags_async`) - `test_convert_tags` (and `test_convert_tags_async`) The reason for the difference is that there are trailing spaces when the tests are run in the CI checks, and no trailing spaces when run with `make test`. I ensured that the tests pass in CI, but they may fail with `make test` due to the addition of trailing spaces. --------- Co-authored-by: Erick Friis <[email protected]>
Thank you for contributing to LangChain! **PR title**: "community: fix PDF Filter Type Error" - **Description:** fix PDF Filter Type Error" - **Issue:** the issue langchain-ai#27153 it fixes, - **Dependencies:** no - **Twitter handle:** if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <[email protected]>
hey there! I believe this conflicted with a change adding a filtering syntax to FAISS. Would you be interested in re-implementing just the |
Added Langchain complete tutorial playlist from total technology zonne channel .In this playlist every video is focusing one specific use case and hands on demo.All tutorials are equally good for every levels . Thank you for contributing to LangChain! - [ ] **PR title**: "package: description" - Where "package" is whichever of langchain, community, core, etc. is being modified. Use "docs: ..." for purely docs changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [ ] **PR message**: ***Delete this entire checklist*** and replace with - **Description:** a description of the change - **Issue:** the issue # it fixes, if applicable - **Dependencies:** any dependencies required for this change - **Twitter handle:** if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [ ] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [ ] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <[email protected]> Co-authored-by: Erick Friis <[email protected]>
Co-authored-by: Jesse Schumacher <[email protected]> Co-authored-by: Jesse S <[email protected]> Co-authored-by: dylan <[email protected]>
Issue: Here is an ambiguity about W&B integrations. There are two existing provider pages. Fix: Added the "root" W&B provider page. Added there the references to the documentation in the W&B site. Cleaned up formats in existing pages. Added one more integration reference. --------- Co-authored-by: Erick Friis <[email protected]> Co-authored-by: Eugene Yurtsev <[email protected]>
- **Description:** Adds a helper that renders documents with the GraphVectorStore metadata fields to Graphviz for visualization. This is helpful for understanding and debugging. --------- Co-authored-by: Erick Friis <[email protected]>
Thank you for contributing to LangChain! - [x] **PR title**: langchain: add URL parameter to ChatDeepInfra class - [x] **PR message**: add URL parameter to ChatDeepInfra class - **Description:** This PR introduces a url parameter to the ChatDeepInfra class in LangChain, allowing users to specify a custom URL. Previously, the URL for the DeepInfra API was hardcoded to "https://stage.api.deepinfra.com/v1/openai/chat/completions", which caused issues when the staging endpoint was not functional. The _url method was updated to return the value from the url parameter, enabling greater flexibility and addressing the problem. out! --------- Co-authored-by: Erick Friis <[email protected]>
Bump unstructured to pick up resolution of Unstructured-IO/unstructured#3795
… values in index_to_docstore_id, implement get_by_ids method
…to faiss-doc-ids
dosubot
bot
added
size:XXL
This PR changes 1000+ lines, ignoring generated files.
and removed
size:M
This PR changes 30-99 lines, ignoring generated files.
labels
Dec 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
community
Related to langchain-community
size:XXL
This PR changes 1000+ lines, ignoring generated files.
Ɑ: vector store
Related to vector store module
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
id
field of documents in aFAISS
docstore to be consistent with values inindex_to_docstore_id
get_by_ids
method forFAISS
vectorstore